Bayesian inference was initially developed by Reverend Thomas Bayes, but his ruminations on inverse probability wouldn’t be known until a friend polished and submitted his work to the Royal Society. Bayes’ work was eventually developed and refined by Laplace. Bayesian inference was wildly different from Fisher’s work in defining classical statistical theory (Lesaffre and Lawson 2012).
In opposition to the Bayesian approach is the frequentist approach. The frequentist approach considers the parameter of interest fixed and inference on the parameter is based on the result of repeated sampling. In the Bayesian approach, the parameter of interest is not fixed but stochastic, and inference on the parameter is based on observed data and prior knowledge (Lesaffre and Lawson 2012).
A benefit of the Bayesian approach lies in the ability to include prior knowledge through the selection of a prior. Priors can be subjective or objective. Subjective priors incorporate opinions, of an individual or of a group, which can negatively impact the perceived integrity of the findings. Objective priors are preferred which follow formal rules for determining uninformative priors (Lesaffre and Lawson 2012).
When prior knowledge is lackluster, has little information, or is otherwise not sufficient, a uninformative prior may be used. A common choice for a uninformative prior, in cases with binomial likelihood, is a uniform distribution, also known as a flat prior. In cases with a Gaussian likelihood, the uninformative prior would be a normal prior with \(\sigma^2 \to \infty\) which functions similarly to the flat prior. For cases with a Poisson likelihood, a Gamma(\(\alpha_0\), \(\beta_0\)) prior is used where the sum of the counts are \((\alpha_0 - 1)\) and the experiment size is \(\beta_0\)(Lesaffre and Lawson 2012). For normal linear regression models, conjugate normal-gamma priors can be used to provide a resulting posterior that is of the same family(Yan and Su 2009).
There are a variety of ways to summarize the posterior in order to derive conclusions about the parameter. Its location and variability can be specified by finding the mode, mean, and median of the distribution. Its range can be defined with the equal tail credible interval (not to be confused with the confidence interval in the frequentist approach) or with the high posterior density (HDP) interval. Future predictions for the parameter can be made through posterior predictive distributions (PPD) which factor out \(\theta\) with the assumption that past data will not influence future data, hierarchical independence(Lesaffre and Lawson 2012).
It is not uncommon for the marginal posterior density function of a parameter to be unavailable, requiring an alternate approach to extract insight. It is safe to assert that sampling techniques are used in nearly all Bayesian analyses(Yan and Su 2009). General purpose sampling algorithms available include the inverse ICDF method, the accept-reject algorithm, importance sampling, and the commonly used Monte Carlo integration. The Monte Carlo integration replaces the integral of the posterior with the average obtained from randomly sampled values to provide an expected value. Two popular Markov chain Monte Carlo (MCMC) methods are the Gibbs sampler and the Metropolis-Hastings (MH) algorithm(Lesaffre and Lawson 2012).
1.1 Applications
Kong et al. (2020) explored the use of BLR for surface roughness prediction in the milling process of wood. Input was taken from the time-domain of 3 different vibration signals to quantify surface roughness, tool wear monitoring was also taken into account.
Modifications to the standard BLR model were utilized to compare the accuracy of results in 4 different regressions:
Standard_BLR was found to be the superior of the four and to outperform even partial least squares (PLS) artificial neural networks (ANN) and support vector machines (SVM). BLR is also beneficial as it produces a CI which is advantageous for quality control(Kong et al. 2020).
Fraza et al. (2021) propose a warped BLR model that would solve the issue of non-Gaussian distributions using a “likelihood warping technique”. Gaussian process regression is used historically in the analysis of neuroimaging but is problematic because the method assumes a Gaussian distribution which is not the best fit for each modality being assessed. Gaussian process regression is also insufficient to handle emerging big data. The study acquired imaging data from the UK Biobank imaging dataset, using specifically image-derived phenotypes (IDPs) for their prevalence and diffusion tensor imaging (DTI) data for their likelihood to have non-gaussian and non-linear trends. The warped BLR model was found to outperform both standard and Gaussian process regression models in terms of fit and scalability for the larger data sets(Fraza et al. 2021).
Zhang et al. (2022) used k-means clustering and standard BLR to create a performance indicator to monitor the health of a bridge. Data was acquired from weigh-in-motion (WIM) and structural health monitoring (SHM) systems from a concrete box girder bridge; these two systems are typically used independently but were both integrated into the model. The study identifies the slope of the regression, deemed the model slope indicator (MSI), for vehicle load (from the WIM system) and vehicle induced strain (from the SHM system) as a performance indicator for the bridge where a change in the regression would indicate a performance or structural issue. Metropolis-Hastings (M-H) sampling was the chosen sampling method to help solve problems in the data with noise, abnormal signals, and missing and omitted data. Three BLR models were made based on three different classes of vehicle weight and associated bridge strain that were determined through k-means clustering.
The threshold for change in MSI was set to 5\(\%\) and hypothesis testing was used to verify its significance. The model would not trigger a warning when the change of MSI was lower than 5% as this was an acceptable damage level for the concrete bridge used in this study(Zhang et al. 2022).
Li et al. (2023) proposed the use of BLR with Cauchy prior (BLRC) for the analysis of multiple-input multiple-output (MIMO) radar systems with improved resolution by sparse array (SPA) designs. The researchers compared BLRC to the more traditional approaches Cauchy-Gaussian(CG) and sparse Bayesian learning (SBL). The frameworks for the traditional approaches were found to be insufficient. CG uses the maximum a posteriori (MAP) framework which is inefficient as optimal hyperparameters can only be found via trial and error. SBL was superior to CG but was found to have room for improvement in cases where input was from SPA. BLRC was found to be superior to SBL in a few ways. BLRC uses a long-tailed Cauchy distribution resulting in the need for only two hyperparameters. SBL has many hyperparameters. BLRC also outperforms SBL with higher resolution radar images, better handling of noise, and flexibility of the system(Li et al. 2023).
Yang et al. (2019) aimed to improve the prediction of storm wind speed forecasts with the use of gridded Bayesian linear regression (GBLR). Historical storm data, a database of 92 storms in the Northeast US, was used to adjust predictions for the National Center for Atmospheric Research (NCAR) real-time dynamic ensemble prediction system (EPS). The GBLR approach involved inverse distance weighting to interpolate regression coefficients, transitioning from station locations to a model grid, and a novel “implicit” method wherein the BLR optimizes regression coefficients for the ensemble members collectively. GBLR was found to reduce bias in wind speed forecasts and improved \(R^2\) and RMSE values when compared to the ensemble mean for global and event-based metrics. Similar improvements were seen in seasonal analyses. However, GBLR struggles with extreme values and in areas of the grid with sparse observations(Yang, Astitha, and Schwartz 2019).
2 Methods
2.1 The Frequentist Framework
Linear regression can be achieved using a variety of methods, two of interest are frequentist and Bayesian. The frequentist approach to linear regression is the more familiar approach. It estimates the effects of independent variables(predictors) on dependent variables(the outcome). The regression coefficient is a point estimate, assumed to be a fixed value. Following is the frequentist linear model
The Bayesian approach estimates the relationship between predictors and an outcome in a similar way, however it’s regression coefficient is not a point estimate, but a distribution. That is, the regression coefficient is not assumed to be a fixed value. The Bayesian approach also goes a step further then frequentist regression in it’s inclusion of prior data. The Bayesian approach is so named because it is based on Bayes’ rule which is written as follows:
The normalization constant (\(p(y)\) above) ensures the posterior distribution is a valid distribution, but the posterior density function can be written without this constant. The resulting prediction is not a point estimate, but a distribution (Bayes 1763). The Bayesian approach is derived with Bayes’ theorem wherein the posterior distribution, the updated belief about the parameter given the data \(p(\theta|y)\), is proportional to the likelihood of \(\theta\) given \(y\), \(L(\theta|y)\), and the prior density of \(\theta\), \(p(\theta)\). The former is known as the likelihood function and would comprise the new data for analysis while the latter allows for the incorporation of prior knowledge regarding \(\theta\)(Yan and Su 2009).
\[
f(\theta|D) \propto L(\theta|D)f(\theta)
\]
2.3 The Models
To generate a model for our analysis, we start with the normal data model \(Y_i|\beta_0, \beta_1, \sigma \sim N(\mu, \sigma^2)\) and include the mean specific to our predictor, departure time, \(\mu_i\). The model is
The categorical predictor, Day of the Week, is given a similar treatment to the continuous predictor with the sampling of three Normal models with different priors. Tuesday is set to be the reference as it has the lowest mean delay.
The general formula is adjusted for the inclusion of regression coefficients for each day of the week \(\beta_1, \beta_2, ..., \beta_6\), where the intercept coefficient \(\beta_0\) reflects Tuesday’s arrival delay. Only one prior for the predictor must be specified as the stan_glm() function handles categorical variables in a similar way to the glm() function, through dummy coding.
2.4 Prior Selection
Since we are only using two continuous variables for the first model, arrival delay and departure time, the regression parameters will be \(\beta_0\), \(\beta_1\), and \(\sigma\) for intercept, slope, and error As intercept and slope regression parameters can take any real value, we will use Normal prior models (Johnson, Ott, and Dogucu 2022).
where \(m_0, s_0, m_1, \text{and } s_1\) are hyperparameters.
The standard deviation parameter must be positive, so we will use an exponential model (Johnson, Ott, and Dogucu 2022).
\[
\sigma \sim \text{Exp}(l)
\]
Due to the fact that the exponential model is a special case of the Gamma model, with \(s = 1\), we can use the definitions of the mean and variance of the gamma model to to find that of the exponential model (Johnson, Ott, and Dogucu 2022).
This analysis will explore the effect of different priors on two models. Traditionally, priors are tuned by selecting values for the aforementioned hyperparameters based on prior knowledge about the data being modeled. These hyperparameters are specific to the distributions of the priors. In the case of this study, the mean and standard deviation are estimated from the data in order to tune the Normal priors. In addition to the model with manually tuned priors, two other versions of each of the models will be constructed. One version will involve the use of priors tuned by the function used for modeling, “Default Tuned Priors”, and the other version will utilize flat priors.
Tuned Continuous Model
\(\beta_0\) informs the model intercept
Code
summary(Delays_sample$DEP_TIME_MINS) #mean departure time is 809.3 minutes (~ 1:30pm)Delays_sample_filtered_B0 <-subset(Delays_sample, DEP_TIME_MINS >=800& DEP_TIME_MINS <=820)mean(Delays_sample_filtered_B0$ARR_DELAY) #m_0c = 2sd(Delays_sample_filtered_B0$ARR_DELAY) #s_0c = 36
\(\beta_{0c}\) reflects the typical arrival delay at a typical departure time. With a mean departure time at \(\sim\) 1:30pm, the average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 36 minutes.
The slope of the linear model indicates a 0.019 minute increase in arrival delay per minute increase in departure time, so we set \(m_1 = 0.02\). The standard error reflects high confidence at 0.0005, but as to not limit the model we will set it lower at \(s_1 = 0.01\).
\[
\beta_{1} \sim N(0.02, 0.01^2)
\]
\(\sigma\) informs the regression standard deviation
Code
summary(lm_model)$sigma
To tune the exponential model, we set the expected value of the standard deviation, \(E(\sigma)\), equal to the residual standard error, \(\sim 50\). With this, we can find the rate parameter, \(l\).
For arrival delays by the day of the week, mean arrival delays are between 1 and 7 minutes while the median arrival delays are all in the negative, indicating a skew towards larger delays.
\(\beta_{0}\) reflects the mean arrival delay on Tuesday, our reference. The average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 46 minutes.
\[
\beta_{0} \sim N(2, 46^2)
\]
\(\beta_j\) informs the model slopes
For a categorical predictor with the stan_glm() function, the tuned prior, \(\beta_j\), is applied to to the estimation of each coefficient associated with the individual levels of the predictor ($_1, _2, …, _6 $). For this reason, we set the coefficient prior to be weakly informative.
\[
\beta_{j} \sim N(0, 50^2)
\]
\(\sigma\) informs the regression standard deviation
Code
lm_model <-lm(ARR_DELAY ~ DAY_OF_WEEK, data = Delays_sample)summary(lm_model)$sigma
To tune the exponential model, we set the expected value of the standard deviation, \(E(\sigma)\), equal to the residual standard error which is the same as with the previous model, \(\sim 50\).
The Bayesian Normal regression model relies on three assumptions that are not dissimilar to those of frequentist linear regression. Primarily, each observation should be independent of any other observation. The relationship between the outcome and predictor variables is expected to be linear. It is also assumed that the variability of observed data is normally distributed around its’ mean with a consistent standard deviation across all levels of the predictor(homoscedasticity)(Johnson, Ott, and Dogucu 2022).
2.7 Statistical Programming
Data was collected by the Bureau of Transportation Statistics (BTS) and accessed through a data set compiled by Patrick Zelazko (Zelazko 2023). The data was imported into R (R Core Team 2023) via CSV. This is a large time-series data set with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minutes and attributed reasons for delay. The function stan_glm() was used for simulation of the Normal Bayesian linear regression model from the “rstanarm” library(Brilleman et al. 2018). This function runs the Markov Chain Monte Carlo simulation as well with specified chains, iterations, and the ability to set a seed. These were set to 4 chains, 2000 iterations, and the seed was set to 123. Simulation of the posterior was done with the posterior_predict() function, also from the “rstanarm” library(Brilleman et al. 2018). Evaluation of the model was done by considering the data and it’s source, the assumptions of the model, and the accuracy of the prediction. The posterior predictions were evaluated via k-fold cross validation with the prediction_summary_cv() function from the “bayesrules”library (Dogucu, Johnson, and Ott 2021). This provided median absolute error (MAE) scaled MAE, and the proportion of values that fall within 50% and 95% confidence intervals. Effective sample size (\(N_\text{eff}\)) and R-hat(\(\hat{R}\)) are also calculated using the “bayesplot” package(Gabry et al. 2019).
3 Analysis
3.1 The Dataset
This is a large data set with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minuted and attributed reasons for delay. Following are the definitions of the given variables in this data set.
Header
Description
Fl Date
Flight Date (yyyy-mm-dd)
Airline
Airline Name
Airline DOT
Airline Name and Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
Airline Code
Unique Carrier Code
DOT Code
An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
Fl Number
Flight Number
Origin
Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Origin City
Origin City Name, State Code
Dest
Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Dest City
Destination City Name, State Code
CRS Dep Time
CRS Departure Time (local time: hhmm)
Dep Time
Actual Departure Time (local time: hhmm)
Dep Delay
Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
Taxi Out
Taxi Out Time, in Minutes
Wheels Off
Wheels Off Time (local time: hhmm)
Wheels On
Wheels On Time (local time: hhmm)
Taxi In
Taxi In Time, in Minutes
CRS Arr Time
CRS Arrival Time (local time: hhmm)
Arr Time
Actual Arrival Time (local time: hhmm)
Arr Delay
Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
Cancelled
Cancelled Flight Indicator (1=Yes)
Cancellation Code
Specifies The Reason For Cancellation
Diverted
Diverted Flight Indicator (1=Yes)
CRS Elapsed Time
CRS Elapsed Time of Flight, in Minutes
Actual Elapsed Time
Elapsed Time of Flight, in Minutes
Air Time
Flight Time, in Minutes
Distance
Distance between airports (miles)
Carrier Delay
Carrier Delay, in Minutes
Weather Delay
Weather Delay, in Minutes
NAS Delay
National Air System Delay, in Minutes
Security Delay
Security Delay, in Minutes
Late Aircraft Delay
Late Aircraft Delay, in Minutes
Table 1 shows the flights distinguished by a new variable, Flight Period, where each time period is comprised of an 8-hour segment (i.e. Morning has flights with departure times from 4am to noon followed by afternoon and evening). The Afternoon period is comprised of the most flights (47.4%), followed closely by the Morning period (41.5%), and the Evening period trails the two (11%). The table also gives the means of the departure and arrival times, giving an indication of the density of the flights in the given period. The average departure and arrival delays show much better numbers for the Morning period (5.23, -0.77 minutes) with increasing delays for the Afternoon and Evening periods. The delay counts by type show That the Afternoon and Morning periods account for significantly more of the total delays, though that is without taking into account the smaller contribution of flights by the Evening period on the whole.
Table 1: Flight Delay Summary by Flight Period
Flight Period
Flight Period
Morning
Afternoon
Evening
Total
TotalFlightsCount
1246031 (41.5%)
1423140 (47.4%)
330829 (11.0%)
3000000 (100%)
CancelledFlightsCount
30690 (38.8%)
38343 (48.4%)
10107 (12.8%)
79140 (100%)
DivertedFlightsCount
2555 (36.2%)
3901 (55.3%)
600 (8.5%)
7056 (100%)
AvgCRSDepTime
08:49:31
15:73:19
20:66:23
13:27:04
AvgDepTime
08:53:58
15:89:05
20:12:40
13:29:47
AvgDepDelay
5.23
12.93
16.51
10.12
AvgTaxiOut
16.87
16.44
16.65
16.64
AvgTaxiIn
7.75
7.78
6.95
7.68
AvgCRSArrTime
10:87:15
17:85:11
17:42:14
14:90:34
AvgArrTime
10:86:01
17:71:56
15:89:47
14:66:31
AvgArrDelay
-0.77
7.34
10.04
4.26
AvgAirTime
114.12
109.8
116.31
112.31
CarrierDelayCount
86824 (29.2%)
162266 (54.6%)
47861 (16.1%)
296951 (100%)
SecurityDelayCount
887 (32.1%)
1434 (52.0%)
438 (15.9%)
2759 (100%)
WeatherDelayCount
8380 (26.7%)
18758 (59.7%)
4290 (13.7%)
31428 (100%)
NASDelayCount
80604 (31.4%)
144366 (56.3%)
31507 (12.3%)
256477 (100%)
LateAircraftDelayCount
42721 (16.5%)
168902 (65.2%)
47391 (18.3%)
259014 (100%)
Summary includes morning, afternoon, and evening flight periods.
3.2 Exploratory Data Analysis
The histograms in Figure 1 illustrate the frequencies of air time, arrival delays, and departure delays. The y-axis was transformed to make the visualizations more legible. All show a skew to the right. This makes sense for air times with a higher proportion of regional flights and the exclusion of international departures and arrivals. Shorter delays (for both arrivals and departures) being more frequent than longer delays is also to be expected.
Figure 2 shows the average arrival delay for the largest five airlines (filtered for carriers with over 200,000 flights in the given period). The standard deviations for these airlines are fairly small, indicating a low variability in the arrival delays for these airlines.
Figure 3 displays the average arrival delay for flights at their origin airport using “plotly” (Sievert 2020) and “ggmap” packages(Kahle and Wickham 2013). This comes from the idea that a flight’s arrival delay is collinear with it’s departure delay which may be influenced by it’s origin airport. Airport location information sourced from Random Fractals Inc(Novak 2023).
Figures 4 and 5 show the correlation of the continuous variables and the correlation of the categorical and continuous variables respectively using the “corrplot” package(Wei and Simko 2024). Figure 4 doesn’t provide any particularly valuable insight. Arrival delay has an expected positive correlation coefficient with departure delay. There appears to be a slight positive correlation between taxi times and arrival delay and elapsed time which is also expected.
Figure 5 shows the correlation of the continuous and categorical variables using the p-values from the t-tests and ANOVA’s of each relation. Relationships with at least one significant p-value are shown in dark blue. This heatmap is of limited value as there appears to be many significant correlations, but without further inspection it is not known which value(s) of the categorical variables have a significant p-value.
Code
p_value_long <-melt(p_value_results, na.rm =TRUE) # Remove NAsggplot(data = p_value_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile(color ="white") +scale_fill_gradient(low ="blue", high ="red", na.value ="gray90", name ="p-value") +labs(x ="Categorical Variable",y ="Continuous Variable",title ="Figure 5. Heatmap of p-values for categorical vs continuous variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0),plot.title.position ="plot")
3.3 Data Preprocessing
To prepare the data for analysis, a few changes were made. Diverted and cancelled flights were removed as they both have arrival delays set to “NA” due to them not arriving as scheduled. Including cancelled and diverted flights would introduce complexity and could lead to bias in the model. The goal of the model is to understand typical arrival delays, so the inclusion of cancelled and diverted flights is unnecessary. The data set is also quite large with three million observations, so the data set was sampled to account for computational efficiency. For the first model, the departure time variable was converted from an “HHMM” format to a “minutes past midnight” format for use by the stan_glm() function.
ggplot(Delays_sample, aes(x = DEP_TIME)) +geom_histogram(binwidth =10, fill ="blue", color ="black") +labs(x ="Departure Time (time HHMM)",y ="Frequency",title ="Figure 6. Distribution of departure time") +xlim(NA, 2400) +theme_minimal()
Code
convert_to_minutes <-function(time) { hour <- time %/%100 minute <- time %%100 total_minutes <- hour *60+ minutereturn(total_minutes)}Delays_sample <- Delays_sample %>%mutate("DEP_TIME_MINS"=sapply(DEP_TIME, convert_to_minutes))
Code
ggplot(Delays_sample, aes(x = DEP_TIME_MINS)) +geom_histogram(binwidth =10, fill ="blue", color ="black") +labs(title ="Figure 7. Distribution of adjusted departure time",x ="Departure Time (mins past midnight)",y ="Frequency") +xlim(NA, 1440) +theme_minimal()
For the second model, “Day of the Week” was the chosen predictor variable which was created using the “lubridate” library(Grolemund and Wickham 2011).
ggplot(Delays_sample, aes(x = DAY_OF_WEEK, y = ARR_DELAY)) +geom_boxplot(outlier.shape =NA) +geom_point(data = mean_delay_by_day, aes(x = DAY_OF_WEEK, y = mean_arr_delay), color ="red4", size =3, shape =8) +labs(title ="Figure 9. Arrival delay by the day of the week.",x ="Day of the Week",y ="Arrival Delay (minutes)") +theme_minimal()+ylim(-45,45)
3.4 Modeling
3.4.1 Bayesian Normal Regression: Departure Time Predictor
The posterior median relationship for the first model with flat priors is
\[
-10.92 + 0.02X.
\]
That is, at \(0 \text{ minutes}\) (i.e. Midnight), the expected delay is \(-10.92\). For every minute past midnight, the delay is expected to increase by \(0.02 \text{ minutes}\) (\(1.2 \text{ seconds}\)). At noon (\(X = 720 \text{ minutes}\)), the expected delay for any flight is \(3.48 \text{ minutes}\) (\(\sim 3 \text{ minutes and } 29\text{ seconds}\)).
It is important to note that the uniform prior distribution used are noted as “not non-informative” by the documentation as they provide a uniform distribution that gives the same probability to all values, whether they are plausible or not.
library(rstanarm)tuned_model_dt <-stan_glm(ARR_DELAY ~ DEP_TIME_MINS, data = Delays_sample,family =gaussian(),prior_intercept =normal(2, 1296),prior =normal(0.02, 0.0001), prior_aux =exponential(0.02),chains =4, iter =2000, seed =123,refresh =0 )library(broom.mixed)tmdt <-tidy(tuned_model_dt, effects =c("fixed", "aux"),conf.int =TRUE, conf.level =0.95)# Store the 4 chains for each parameter in 1 data framemodel_tuned_df <-as.data.frame(tuned_model_dt)##TUNED# Simulate a set of predictionsset.seed(123)shortcut_prediction <-posterior_predict(tuned_model_dt, newdata =data.frame(DEP_TIME_MINS =720))ggarrange(mcmc_dens(shortcut_prediction) +xlab("Delay (minutes)") +ggtitle("Figure 12 (a) Predicted Arrival Delays for a Departure Time of Noon, Tuned Priors"))
Code
model <- tuned_model_dtplotInt <-mcmc_trace(model, pars ="(Intercept)") +labs(title ="Figure 12 (b) MCMC Trace", subtitle ="Intercept")plotBeta <-mcmc_trace(model, pars ="DEP_TIME_MINS") +labs(subtitle ="Departure Time (mins)")plotSig <-mcmc_trace(model, pars ="sigma") +labs(subtitle ="Sigma")ggarrange(plotInt, plotBeta, plotSig, ncol =1, align ="v", heights =c(4, 4, 4))
That is, on Tuesday one can expect a delay of \(1.57\) minutes (\(1\) minute and \(34\) seconds). One can expect the longest delay on Wednesday at \(6\) minutes and \(29\) seconds.
This version of the second model showed little change for the expectations of Tuesday through Thursday, but the rest of the week had shifts of more than a minute for all days except Sunday with a shift of \(53\) seconds.
The tuned version of the second model showed coefficients more similar to that of the flat model.
3.5 Modeling Results
A total of 100,000 observations were included in the model creation for this study. As arrival delay was the outcome of interest, cancelled and diverted flights were excluded from this study for reasons to include missing values and extenuating circumstances that would introduce bias. Following, in Table 2, is a summary of the regression coefficients, their standard error, and 95% credible intervals from the Bayesian regression.
Table 2. Estimations of the Posterior Distributions’ Regression Coefficients.
Mean
SE
95% CI
Model 1: Continuous Predictor
Flat Priors
𝛽₀ Intercept
-10.92
0.47
(-11.85; 0.02)
𝛽₁ Departure Time
0.02
0.00
(0.02; 50.86)
𝜎
51.09
0.12
(50.86; -11.86)
Default Tuned Priors
𝛽₀ Intercept
-10.94
0.47
(-11.86; 0.02)
𝛽₁ Departure Time
0.02
0.00
(0.02; 50.86)
𝜎
51.09
0.12
(50.86; -12.02)
Tuned Priors
𝛽₀ Intercept
-11.66
0.17
(-12.02; 0.02)
𝛽₁ Departure Time
0.02
0.00
(0.02; 50.87)
𝜎
51.10
0.12
(50.87; 0.00)
Model 2: Categorical Predictor
Flat Priors
𝛽₀ Intercept (Tuesday)
1.57
0.44
(0.69; 3.72)
𝛽₅ Sunday
4.36
0.61
(3.21; 1.72)
𝛽₆ Monday
2.93
0.65
(1.72; 51.16)
𝛽₁ Wednesday
4.92
0.61
(3.72; 1.47)
𝛽₂ Thursday
2.66
0.65
(1.47; 0.69)
𝛽₃ Friday
1.86
0.62
(0.69; 2.25)
𝛽₄ Saturday
3.43
0.61
(2.25; 3.21)
𝜎
51.39
0.12
(51.16; 0.73)
Default Tuned Priors
𝛽₀ Intercept (Tuesday)
1.54
0.40
(0.73; 3.21)
𝛽₅ Sunday
3.48
0.58
(2.31; 0.74)
𝛽₆ Monday
1.89
0.60
(0.74; 51.17)
𝛽₁ Wednesday
4.40
0.60
(3.21; 1.48)
𝛽₂ Thursday
2.69
0.60
(1.48; 1.75)
𝛽₃ Friday
2.97
0.64
(1.75; 3.77)
𝛽₄ Saturday
4.96
0.59
(3.77; 2.31)
𝜎
51.40
0.12
(51.17; 0.66)
Tuned Priors
𝛽₀ Intercept (Tuesday)
1.54
0.44
(0.66; 3.76)
𝛽₅ Sunday
4.41
0.61
(3.23; 1.73)
𝛽₆ Monday
2.99
0.64
(1.73; 51.17)
𝛽₁ Wednesday
4.96
0.64
(3.76; 1.48)
𝛽₂ Thursday
2.70
0.61
(1.48; 0.71)
𝛽₃ Friday
1.91
0.64
(0.71; 2.28)
𝛽₄ Saturday
3.50
0.63
(2.28; 3.23)
𝜎
51.39
0.12
(51.17; 0.00)
The three versions of Model 1, with departure time as predictor, agree on a \(\beta_1\) coefficient of \(0.02\) indicating that the delay is expected to increase by \(0.02 \text{ minutes}\) (\(1.2 \text{ seconds}\)) for each minute past midnight. There is some discrepancy with the intercept term \(\beta_0\) with values ranging from \(-10.92\) to \(-11.66\text{ minutes}\). Standard deviation is large comparatively at \(\sim\) 51.09 minutes.
There is more variability apparent in the summary of Model 2, though the intercept term is consistent among the three versions of the model at \(1.54\) to \(1.57 \text{ minutes}\)(a matter of \(1.8 \text{ seconds}\)). Daily estimated delays range from \(1.86\) to \(4.96 \text{ minutes}\). Again, standard deviation is large comparatively at \(\sim 51.39 \text{ minutes}\).
3.6 K-Fold Cross Validation
Cross validation was done to provide an estimation of model performance. The provided posterior predictive summary is comprised of the median absolute error (MAE) which measures the typical difference between observed and posterior predictive means, the scaled MAE (scaled_MAE) which scales MAE by standard deviations, and the proportion of observations that fall within the \(50\%\) and \(95\%\) posterior prediction intervals. The cross validation assessment (prediction_summary_cv()) was chosen over the posterior prediction summary assessment (prediction_summary()), both from the “bayesrules” package(Dogucu, Johnson, and Ott 2021), as the cross validation procedure provides a more conservative estimate of model performance. Cross validation can also be more computationally efficient with a lower number of folds and the use of a smaller data set.
Code
library(bayesrules)# #Posterior Predictive Summary# # set.seed(123)# ps_fmdt <- prediction_summary(flat_model_dt, data = Delays_sample)# # set.seed(123)# ps_dmdt <- prediction_summary(default_model_dt, data = Delays_sample)# # set.seed(123)# ps_tmdt <- prediction_summary(tuned_model_dt, data = Delays_sample)# # ps_fmdt <- ps_fmdt %>% # mutate(Model = "fmdt")# ps_dmdt <- ps_dmdt %>% # mutate(Model = "dmdt")# ps_tmdt <- ps_tmdt %>% # mutate(Model = "tmdt")# # ps_results <- bind_rows(ps_fmdt, ps_dmdt, ps_tmdt)# # print(ps_results)#K-fold Cross Validation with k=5, data n=1000set.seed(123)sample_size <-1000Delays_sample2 <- Delays_sample %>%sample_n(sample_size)set.seed(123)cv_fmdt <-prediction_summary_cv(model = flat_model_dt, data = Delays_sample2, k =5)set.seed(123)cv_dmdt <-prediction_summary_cv(model = default_model_dt, data = Delays_sample2, k =5)set.seed(123)cv_tmdt <-prediction_summary_cv(model = tuned_model_dt, data = Delays_sample2, k =5)cv_fmdt <- cv_fmdt$cv %>%mutate(Model ="fmdt")cv_dmdt <- cv_dmdt$cv %>%mutate(Model ="dmdt")cv_tmdt <- cv_tmdt$cv %>%mutate(Model ="tmdt")cv_results <-bind_rows(cv_fmdt, cv_dmdt, cv_tmdt)print(cv_results)
Code
#Posterior Predictive Summary# # set.seed(123)# ps_fmdow <- prediction_summary(flat_model_dow, data = Delays_sample)# # set.seed(123)# ps_dmdow <- prediction_summary(default_model_dow, data = Delays_sample)# # set.seed(123)# ps_tmdow <- prediction_summary(tuned_model_dow, data = Delays_sample)# # ps_fmdow <- ps_fmdow %>% # mutate(Model = "fmdow")# ps_dmdow <- ps_dmdow %>% # mutate(Model = "dmdow")# ps_tmdow <- ps_tmdow %>% # mutate(Model = "tmdow")# # ps_results2 <- bind_rows(ps_fmdow, ps_dmdow, ps_tmdow)# # print(ps_results2)#K-fold Cross Validation with k=5, data n=1000set.seed(123)sample_size <-1000Delays_sample2 <- Delays_sample %>%sample_n(sample_size)set.seed(123)cv_fmdow <-prediction_summary_cv(model = flat_model_dow, data = Delays_sample2, k =5)set.seed(123)cv_dmdow <-prediction_summary_cv(model = default_model_dow, data = Delays_sample2, k =5)set.seed(123)cv_tmdow <-prediction_summary_cv(model = tuned_model_dow, data = Delays_sample2, k =5)cv_fmdow <- cv_fmdow$cv %>%mutate(Model ="fmdow")cv_dmdow <- cv_dmdow$cv %>%mutate(Model ="dmdow")cv_tmdow <- cv_tmdow$cv %>%mutate(Model ="tmdow")cv_results2 <-bind_rows(cv_fmdow, cv_dmdow, cv_tmdow)print(cv_results2)
Table 3. Posterior predictive results from k-fold cross validation.
MAE
MAE Scaled
Within 50%
Within 95%
Model 1: Continuous Predictor
Flat Priors
15.730
0.313
0.841
0.966
Default Tuned Priors
15.779
0.314
0.840
0.966
Tuned Priors
15.668
0.312
0.849
0.966
Model 2: Categorical Predictor
Flat Priors
17.110
0.338
0.866
0.965
Default Tuned Priors
17.080
0.338
0.866
0.966
Tuned Priors
17.118
0.339
0.867
0.966
3.7 Effective Sample Size
Effective sample sizes (\(N_{\text{eff}}\)) for the three iterations of the first model are all well above \(0.1\).
The posterior predictive checks for the two models (\(y_\text{rep}\)) do not accurately capture the observed values (\(y\)). The mean and variability of the predictions appear to be higher than the observed values in both models.
model <- flat_model_dowyrep <-posterior_predict(model)pp_check(y, yrep[1:50,], ppc_dens_overlay) +ggtitle("Figure 17 (a) Posterior Predictive Check Model 2: Flat Priors ") +xlim (NA, 350)
Code
model <- default_model_dowyrep <-posterior_predict(model)pp_check(y, yrep[1:50,], ppc_dens_overlay) +ggtitle("Figure 17 (b) Posterior Predictive Check Model 2: Default Priors ") +xlim (NA, 350)
Code
model <- tuned_model_dowyrep <-posterior_predict(model)pp_check(y, yrep[1:50,], ppc_dens_overlay) +ggtitle("Figure 17 (b) Posterior Predictive Check Model 2: Default Priors ") +xlim (NA, 350)
3.10 OLS vs. Bayesian Regression
The results from ordinary least squares regression are not identical to the results from the Bayesian regression, but there is remarkable similarity. The linear models were constructed using the lm() function from the stats package(R Core Team 2024).
Code
library(tibble)ols_model1 <-lm(ARR_DELAY ~ DEP_TIME_MINS, data = Delays_sample)om1t <-as_tibble(summary(ols_model1)$coefficients)ols_model2 <-lm(ARR_DELAY ~ DAY_OF_WEEK, data = Delays_sample) om2t <-as_tibble(summary(ols_model2)$coefficients)om1tom2t
This project investigated the use of different types of priors with Bayesian linear regression to understand the influence of different predictors on arrival delays for U.S. flights. Flight data was chosen as it could provide an interesting real-world application to examine the use of Bayesian regression.
4.1 Modeling Results
Model 1 examined the effect of departure time on arrival delays. All three versions of the model estimated the regression coefficient \(\beta_1\) to be \(0.02\). That is, for every minute past midnight on any given day, arrival delay is expected to increase by \(0.02 \text{ minutes}\) (\(1.2 \text{ seconds}\)). At noon (\(X = 720 \text{ minutes),}\) the expected delay for any flight is \(3.48 \text{ minutes}\) (\(\sim 3 \text{ minutes and } 29\text{ seconds}\)). The flat, default, and tuned models found intercepts, \(\beta_0\), to be \(-10.92, -10.94, \text{ and } -11.66\) respectively. The difference shown in the intercept of the tuned model (just over 43 seconds), shows a small effect from the adjusted prior. There are no remarkable differences in the estimated standard deviations, \(\sigma\), for Model 1. There are no large differences in standard error for any of the estimates.
Model 2 examined the effect of the day of the week as a categorical variable on arrival delays. Tuesday was set to the reference as it had the lowest average delay. The estimated intercepts, \(\beta_0\), for the three versions of the model are all fairly similar as are the estimated standard deviations. The flat and tuned models have similar estimates for the coefficients of the day variables, though the model with default tuned priors has higher weights associated with the Friday and Saturday coefficients and lower weights for the Sunday, Monday and Wednesday coefficients. There are no large differences in standard error for any of the estimates.
The lack of any large differences between the models is likely due to the large amount of data, 100,000 observations, put into the models. This indicated a limited effect of the priors on the models in light of a large amount of data.
4.2 Evaluation of the Models
The 5-fold cross validation provides median absolute errors of \(\sim 15.7\) for Model 1 and \(\sim 17.1\) for Model 2. This indicated a typical error of \(15.7\) and \(17.1 \text{ minutes}\), or standard deviations of \(\sim 0.31\) and \(\sim 0.31\), for the models respectively. The cross validation does also estimate that \(\sim 96.6\%\) of the observations fall within their \(95\%\) prediction interval. This assessment provides reassurance that the models do provide accurate predictions.
The results from analysis of effective sample size and R-hat metrics also provide positive indicators for model accuracy. Effective sample sizes (\(N_{\text{eff}}\)) for all iterations of each model are all well above \(0.1\) indicating that sample is of an appropriate volume for the Markov chain.
Other MCMC diagnostics, including the trace plots, density plots of individual chains and autocorrelations, indicated solid convergence of the Markov chains for all versions of both models. The trace plots showed noise that was higher than desirable, but the chains were well mixed with no visible pi trends. The density plots were fairly consistent across chains for all models, no bimodality was observed. Autocorrelation plots showed rapid decay and weak dependence which mirrors the results from the effective sample size ratios.
The \(\hat{R}\) values for both models fall within an acceptable range such that \(\hat{R} \approx 1\), indicating stability across parallel Markov chains.
The posterior predictive checks for each prior version of the two models are all fairly similar, so they will be considered as a group, and they are concerning. The posterior predictive checks for the two models (\(y_\text{rep}\)) do not accurately capture the observed values (\(y\)). With wider distributions with an upward shift, the mean and variability of the predictions appear to be higher than the observed values in both models. Based on the appearance of the distribution of the y values, it may be appropriate to use a log transformation on the data model, i.e. \(\log(Y_i) \mid \beta_0, \beta_1, \sigma \overset{\text{ind}}{\sim} N(\mu_i, \sigma^2) \quad \text{with} \quad \mu_i = \beta_0 + \beta_1 X_i\).
As the differences between the model version with differing priors was slight, linear models were constructed via ordinary least squares(OLS) for comparison. The OLS models were compared to the Bayesian models with default priors. Model 1 was found to have similar parameter estimates to the OLS model to include the standard errors for those parameter estimates. The residual standard error from the OLS model was also similar to the standard deviation of the Bayesian model. Model 2 had similar intercept and standard deviation estimates to the OLS model, though there were differences in the other parameter estimates.
In checking the model assumptions, it is relevant to examine the context of the data. The data was assembled by the Department of Transportation and includes all national flights, within the U.S., from January of 2019 through August of 2023. Each observation in the data set is an individual flight with it’s associated times, locations, and attributed causes for delay. The first observation requires that each observation be independent. While we assumed this to generally be true, a fault in this logic may occur due to a domino effect of late flights. This domino effect could likely be the cause of the slight increase in delays throughout the day. Assumptions two and three which require linearity and both Normal and homoscedastic variance can be assessed through the posterior predictive check. The posterior predictive checks of the models demonstrated that assumptions two and three were violated as the predictions did not reliably capture the observations which were not Gaussian in nature.
5 Conclusion
This analysis highlighted the diminished effect priors have on posterior distributions in light of large amounts of data. Model 1 resulted in the coefficient \(\beta_1= 0.02 \text{ minutes}\), indicating 1.2 second increase in arrival delay per each minute past midnight, predicting, for example, a 3 and a half minute arrival delay at noon. Model 2 indicated that delays varied by the day of the week, but with no clear trend. Comparison of Model 1 with an OLS model showed little differences, further supporting the diminished effect of priors with large data sets. There were some differences with the categorical OLS model and the categorical coefficients of Model 2, but nothing greater than the differences provided by the three different iterations the Bayesian model.
Diagnostics provided support of the MCMC sampling with effective sample sizes all larger than 0.1, \(\hat{R} \approx 1\) and desirable trace plots, density plots and autocorrelations. Although the Markov Chain Monte Carlo provided good results, the posterior predictive check indicated issues with the model, indicating a non-normal posterior would be a better fit. A log transformation of the outcome variable may be of benefit.
The posterior predictive check also indicated a potential violation of assumptions. It may be of interest in future research to account for the domino effect that results from late flights - one of the potential issues with the first model.
References
Bayes, T. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances. 1763.”
Brilleman, SL, MJ Crowther, M Moreno-Betancur, J Buros Novik, and R Wolfe. 2018. “Joint Longitudinal and Time-to-Event Models via Stan.”https://github.com/stan-dev/stancon_talks/.
Dogucu, Mine, Alicia Johnson, and Miles Ott. 2021. Bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book. https://github.com/bayes-rules/bayesrules.
Fraza, Charlotte J., Richard Dinga, Christian F. Beckmann, and Andre F. Marquand. 2021. “Warped Bayesian Linear Regression for Normative Modelling of Big Data.”NeuroImage 245: 118715. https://doi.org/https://doi.org/10.1016/j.neuroimage.2021.118715.
Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.”J. R. Stat. Soc. A 182: 389–402. https://doi.org/10.1111/rssa.12378.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.”Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Johnson, Alicia A, Miles Q Ott, and Mine Dogucu. 2022. Bayes Rules!: An Introduction to Bayesian Modeling with R. Chapman & Hall.
Kong, Dongdong, Junjiang Zhu, Chaoqun Duan, Lixin Lu, and Dongxing Chen. 2020. “Bayesian Linear Regression for Surface Roughness Prediction.”Mechanical Systems and Signal Processing 142: 106770. https://doi.org/https://doi.org/10.1016/j.ymssp.2020.106770.
Lesaffre, Emmanuel, and Andrew B Lawson. 2012. Bayesian Biostatistics. 1st ed. Somerset: John Wiley & Sons, Ltd. https://doi.org/https://doi.org/10.1002/9781119942412.
Li, Jun, Ryan Wu, I-Tai Lu, and Dongyin Ren. 2023. “Bayesian Linear Regression with Cauchy Prior and Its Application in Sparse MIMO Radar.”IEEE Trans. Aerosp. Electron. Syst. 59 (6): 9576–97. https://doi.org/https://doi.org/10.1109/TAES.2023.3321585.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
———. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Zhang, Xiaonan, Youliang Ding, Hanwei Zhao, and Letian Yi. 2022. “Long‐term Bridge Performance Assessment Using Clustering and Bayesian Linear Regression for Vehicle Load and Strain Mapping Model.”Struct. Contr. Health Monit. 29 (12). https://doi.org/https://doi.org/10.1016/j.ymssp.2020.106770.
Index
MCMC Trace of Categorical Model With Flat Priors
Code
model <- flat_model_dowmcmc_trace(model, pars ="(Intercept)") +labs(title ="Figure 13 (b) MCMC Trace", subtitle ="Intercept(Tuesday)" )
Code
mcmc_trace(model, pars ="DAY_OF_WEEKSun") +labs(subtitle ="Sunday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKMon") +labs(subtitle ="Monday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKWed") +labs(subtitle ="Wednesday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKThu") +labs(subtitle ="Thursday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKFri") +labs(subtitle ="Friday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKSat") +labs(subtitle ="Saturday")
Code
mcmc_trace(model, pars ="sigma") +labs(subtitle ="Sigma")
MCMC Trace of Categorical Model With Default Tuned Priors
Code
model <- default_model_dowmcmc_trace(model, pars ="(Intercept)") +labs(title ="Figure 14 (b) MCMC Trace", subtitle ="Intercept(Tuesday)" )
Code
mcmc_trace(model, pars ="DAY_OF_WEEKSun") +labs(subtitle ="Sunday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKMon") +labs(subtitle ="Monday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKWed") +labs(subtitle ="Wednesday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKThu") +labs(subtitle ="Thursday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKFri") +labs(subtitle ="Friday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKSat") +labs(subtitle ="Saturday")
Code
mcmc_trace(model, pars ="sigma") +labs(subtitle ="Sigma")
MCMC Trace of Categorical Model With Default Tuned Priors
Code
model <- tuned_model_dowmcmc_trace(model, pars ="(Intercept)") +labs(title ="Figure 15 (b) MCMC Trace", subtitle ="Intercept(Tuesday)" )
Code
mcmc_trace(model, pars ="DAY_OF_WEEKSun") +labs(subtitle ="Sunday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKMon") +labs(subtitle ="Monday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKWed") +labs(subtitle ="Wednesday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKThu") +labs(subtitle ="Thursday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKFri") +labs(subtitle ="Friday")
Code
mcmc_trace(model, pars ="DAY_OF_WEEKSat") +labs(subtitle ="Saturday")
Code
mcmc_trace(model, pars ="sigma") +labs(subtitle ="Sigma")
Priors for model 'flat_model_dt'
------
Intercept (after predictors centered)
~ flat
Coefficients
~ flat
Auxiliary (sigma)
~ flat
------
See help('prior_summary.stanreg') for more details
Priors for model 'flat_model_dt'
------
Intercept (after predictors centered)
~ flat
Coefficients
~ flat
Auxiliary (sigma)
~ flat
------
See help('prior_summary.stanreg') for more details